In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
The dataset used in this lab comes from https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
At face value, this looks like an easy lab, but it has many parts to it, so prepare yourself by reading through it fully before starting.
In [ ]:
def load(path_train, path_test):
# Load up the data.
# You probably could have written this easily:
with open(path_test, 'r') as f: testing = pd.read_csv(f)
with open(path_train, 'r') as f: training = pd.read_csv(f)
# The number of samples between training and testing can vary
# But the number of features better remain the same!
n_features = testing.shape[1]
X_test = testing.ix[:,:n_features-1]
X_train = training.ix[:,:n_features-1]
y_test = testing.ix[:,n_features-1:].values.ravel()
y_train = training.ix[:,n_features-1:].values.ravel()
# Special:
# ...
return X_train, X_test, y_train, y_test
In [ ]:
def peekData(X_train):
# The 'targets' or labels are stored in y. The 'samples' or data is stored in X
print("Peeking your data...")
fig = plt.figure()
fig.set_tight_layout(True)
cnt = 0
for col in range(5):
for row in range(10):
plt.subplot(5, 10, cnt + 1)
plt.imshow(X_train.ix[cnt,:].reshape(8,8), cmap=plt.cm.gray_r, interpolation='nearest')
plt.axis('off')
cnt += 1
plt.show()
In [ ]:
def drawPredictions(X_train, X_test, y_train, y_test):
fig = plt.figure()
fig.set_tight_layout(True)
# Make some guesses
y_guess = model.predict(X_test)
# INFO: This is the second lab we're demonstrating how to
# do multi-plots using matplot lab. In the next assignment(s),
# it'll be your responsibility to use this and assignment #1
# as tutorials to add in the plotting code yourself!
num_rows = 10
num_cols = 5
index = 0
for col in range(num_cols):
for row in range(num_rows):
plt.subplot(num_cols, num_rows, index + 1)
# 8x8 is the size of the image, 64 pixels
plt.imshow(X_test.ix[index,:].reshape(8,8), cmap=plt.cm.gray_r, interpolation='nearest')
# Green = Guessed right
# Red = Fail!
fontcolor = 'g' if y_test[index] == y_guess[index] else 'r'
plt.title('Label: %i' % y_guess[index], fontsize=6, color=fontcolor)
plt.axis('off')
index += 1
plt.show()
In [ ]:
# TODO: Pass in the file paths to the .tra and the .tes files:
X_train, X_test, y_train, y_test = load('', '')
Get to know your data. It seems its already well organized in [n_samples, n_features]
form. Your dataset looks like (4389, 784). Also your labels are already shaped as [n_samples]
.
In [ ]:
peekData(X_train)
Create an SVC classifier. Leave C=1
, but set gamma
to 0.001
and set the kernel
to linear
. Then train the model on the training data and labels:
In [ ]:
print("Training SVC Classifier...")
# .. your code here ..
Calculate the score of your SVC against the testing data:
In [ ]:
print("Scoring SVC Classifier...")
# .. your code here ..
print("Score:\n", score)
In [ ]:
# Let's get some visual confirmation of accuracy:
drawPredictions(X_train, X_test, y_train, y_test)
Print out the TRUE value of the 1000th digit in the test set. By TRUE value, we mean, the actual provided, ground-truth label for that sample:
In [ ]:
# .. your code here ..
print("1000th test label: ", true_1000th_test_value)
Predict the value of the 1000th digit in the test set. Was your model's prediction correct? If you get a warning on your predict line, look at the notes from the previous module's labs.
In [ ]:
# .. your code here ..
print("1000th test prediction: ", guess_1000th_test_value)
Use imshow()
to display the 1000th test image, so you can visually check if it was a hard image, or an easy image:
In [ ]:
# .. your code here ..
Only after you're able to beat the +98% accuracy score of the USPS, go back into the load()
method and look for the line that reads # Special:
Immediately under that line, ONLY alter X_train
and y_train
. Keep just the FIRST 4% of the samples. In other words, for every 100 samples found, throw away 96 of them. To make this easy, keep the samples and labels from th beginning of your X_train
and y_train
vectors.
If the first 4% of your train vector's size yields is a decimal number, then use ceil
to round up to the nearest whole integer.
This operation might require some Pandas indexing skills, or rather some numpy indexing skills, if you'd like to go that route. Feel free to ask on the class forum if you'd like a tip on how to do this; but try to exercise your own muscles first!
Re-Run your application after throwing away 96% your training data. What accuracy score do you get now?
Change your kernel back to linear and run your assignment one last time. What's the accuracy score this time?
Surprised?
In [ ]: